Technology overview

Affymetrix microarrays are a popular commercial platform available for a wide range of genomics applications (gene expression profiling, SNP genotyping, ChIP-chip analysis etc.) in different species. The main distinction between Affymetrix and other array technologies is the use of many short (25mer) probes to measure hybridisation. In this practical, we explore the basic analysis issues for Affymetrix GeneChips which measure gene expression using multiple (11-20) perfect match (PM) and mismatch (MM) probes concentrated in the 3’ region of each transcript.

Despite our focus on expression arrays, many of the processing steps, such as quality assessment and normalisation are equally important for other applications of Affymetrix technology to ensure that the signal is comparable between arrays. In this practical we use several Bioconductor packages (affy, affyPLM, limma, etc.) to read in the raw data, assess data quality and normalise an Affymetrix experiment. Raw data for Affymetrix arrays are usually stored in .cel format. We wouldn’t normally wish to view these files directly. These files contain the intensities of each “cell” on chip surface after image analysis.

  • Each hybridisation has a unique cel file
  • It is know in advance which probe / probe set is located within each cell
  • Each chip type (e.g.hgu95av2) has a cdf (chip description) file which is required to decode which probe and probe set is within each cell.

Example analysis in R

The estrogen dataset

The experiment we will analyse is made-up of eight Affymetrix HGU95Av2 GeneChips. The aim of the experiment is briefly described below (excerpt taken from the factDesign package vignette).

“The investigators in this experiment were interested in the effect of estrogen on the genes in ER+ breast cancer cells over time. After serum starvation of all eight samples, they exposed four samples to estrogen, and then measured mRNA transcript abundance after 10 hours for two samples and 48 hours for the other two. They left the remaining four samples untreated, and measured mRNA transcript abundance at 10 hours for two samples, and 48 hours for the other two. Since there are two factors in this experiment (estrogen and time), each at two levels (present or absent, 10 hours or 48 hours), this experiment is said to have a 2 × 2 factorial design.”

The data for this section are described in the estrogen data package in Bioconductor

The cel files for the example experiement are stored in the estrogen directory

The first stage is to read the targets file. In Bioconductor we often use a targets file to define the files corresponding to the raw data for the experiment and which sample groups they belong to. Such a file can be created in a spreadsheet, or text editor, and usually saved as a tab-delimited file. We can have as many columns in the file as we see fit, and one row for each hybridisation. The sample group information is propogated through the analysis and incorporated in the quality assessment and eventually differential expression. We refer to this as the “phenotype data” for the experiment, or sometimes the “metadata”.

Packages and data

If you did not install the affy Bioconductor package, you will need to do so now:-

source("http://www.bioconductor.org/biocLite.R")
biocLite("affy")

The data for the practical can be found in the course zip file

library(affy)
targetsFile <- "estrogen/estrogen.txt"
targetsFile
[1] "estrogen/estrogen.txt"
pd <- read.AnnotatedDataFrame(targetsFile,header=TRUE,sep="",row.names=1)
pData(pd)

As .cel files are not a typical file format, we cannot use the standard R functions such as read.delim, read.csv and read.table. The function to import the data is ReadAffy from the affy package.

raw <-ReadAffy(celfile.path = "estrogen", filenames=rownames(pData(pd)),phenoData = pd)
raw
replacing previous import ‘AnnotationDbi::tail’ by ‘utils::tail’ when loading ‘hgu95av2cdf’replacing previous import ‘AnnotationDbi::head’ by ‘utils::head’ when loading ‘hgu95av2cdf’
AffyBatch object
size of arrays=640x640 features (19 kb)
cdf=HG_U95Av2 (12625 affyids)
number of samples=8
number of genes=12625
annotation=hgu95av2
notes=

What type of Affy array is this How many features? How many samples?


Diagnostic plots

As with other high-throughput technologies, quality assessment is an important part of the analysis process. Data quality can be checked using various diagnostic plots.

The first diagnostic plot we will meet is the “boxplot”, which is a commonly-used plot in data analysis for comparing data distributions. The exact definition of how is boxplot is drawn can vary, the definitions always hold;

  • The bottom of the box is the first quartile; the 25th percentile
  • The top of the box is the third quartile; the 75th percentile
  • The median is represented by a horizontal line
  • The inter-quartile range (IQR) is the difference between the 75th and 25th percentiles

Here, we draw a boxplot to compare two distributions; x with a mean of 0 and standard deviation of 1, and y with a mean of 2 and standard deviation of 4.

x <- rnorm(100)
y <- rnorm(100, mean = 2,sd=4)
boxplot(x,y)

It is also common to draw whiskers that extend to 1.5 X the IQR from the lower and upper quartile. Any points outside this range can be represented by a dot.

In Bioconductor, you will usually find that package authors have created shortcuts to allow complicated data types to be visualised with common functions. For instance, if we want to construct a boxplot from our raw Affymetrix data it would be quite a daunting task for the novice to extract the revelant information from the object. Instead, we can use functions that we are familiar with such as boxplot. The plot can also be customised in ways that we are familiar with such as changing the color (the col argument) and label orientation (las=2 to make labels perpendicular to the axis)

boxplot(raw,col="red",las=2)


What do you notice from this plot?


Perfect-Match and Mismatch probes

With the short DNA sequences used for microarray hybridisation, there is the possibility for the sequence designed to not be specific-enough and hybridise to many regions of the genome. Affymetrix attempt to resolve this issue by having a mismatch sequence corresponding to each probe. Ideally, the subtracting the mismatch from the corresponding perfect match( \(PM - MM\)) should yield a background-corrected, and more-reliable, signal.


Generate histograms of the PM and MM intensities from the first array. Do you notice any difference in the signal distribution of the PMs and MMs?


par(mfrow=c(2,1))
hist(log2(pm(raw[,1])),breaks=100,col="steelblue",main="PM",xlim=c(4,14))
hist(log2(mm(raw[,1])),breaks=100,col="steelblue",main="MM",xlim=c(4,14))

M-A plots are a useful way of comparing the red and green channels from a two-colour microarray. For Affymetrix data, which is a single-channel technology, M and A values can be calculated using the intensities from a pair of chips. We would typically produce a plot using samples from the same biological group, where we would not expect to observe much difference in intensity for a given gene.

i.e. if X\(_i\) is the intensity of a given probe from chip \(i\) and X\(_j\) is the intensity for the same probe on chip \(j\), then \(A = 1/2 (log_2(X_i) + log_2(X_j))\) and \(M = log_2(X_i) − log_2(X_j)\) . In this experiment, there are 8 GeneChips, which gives 28 distinct pair-wise comparisons between arrays. We will focus on a subset of these below.

mva.pairs(pm(raw)[,1:4],plot.method="smoothScatter")

mva.pairs(pm(raw)[,5:8],plot.method="smoothScatter")


Based on all the plots you have generated, what would you conclude about the overall quality of this experiment? ******

Probe-level Linear Models

Probe-level Linear Models (PLMs) can be used as an additional tool to assess relative data quality within an experiment. Many different model specifications are possible, with the simplest fitting chip, and probe effects to the log\(_2\) intensities within each probeset across an experiment in a robust way. The output is a matrix of residuals, or weights for each chip which can be used as an additional diagnostic; systematically high residuals across an entire array, or a large fraction of an array is indicative of an outlier array. The main use of this tool is in deciding whether or not to keep an array in the down-stream data analysis.

Relative Log Expression (RLE)

The Relative Log Expression (RLE) values are computed by calculating for each probe-set the ratio between the expression of a probe-set and the median expression of this probe-set across all arrays of the experiment. It is assumed that most probe-sets are not changed across the arrays, so it is expected that these ratios are around 0 on a log scale. The boxplots presenting the distribution of these log-ratios should then be centered near 0 and have similar spread. Other behavior would be a sign of low quality.

Normalized Unscaled Standard Error (NUSE)

The Normalized Unscaled Standard Error (NUSE) is the individual probe error fitting the Probe-Level Model (the PLM models expression measures using a M-estimator robust regression). The NUSE values are standardized at the probe-set level across the arrays: median values for each probe-set are set to 1. The boxplots allow checking (1) if all distributions are centered near 1 (typically an array with a boxplot centered around 1.1 shows bad quality) and (2) if one array has globally higher spread of NUSE distribution than others, which may also be a sign of low quality.

library(affyPLM)
Loading required package: gcrma
Loading required package: preprocessCore
plmset <- fitPLM(raw)
NUSE(plmset,las=2)

RLE(plmset,las=2)

We can look at the images of the array surface. Ocassionally this can reveal problems on the array. See this gallery of interesting examples. Affymetrix, and other older arrays are vunerable to spatial artefacts as the same probe set is found in the same location on each chip.

If you want to look at an example of a bad array image, we can use the following code;

bad <- ReadAffy(celfile.path = "estrogen/",filenames="bad.cel")
image(bad)

Try out some images for this dataset

par(mfrow=c(2,4))
image(raw[,1])
image(raw[,2])
image(raw[,3])
image(raw[,4])
image(raw[,5])
image(raw[,6])
image(raw[,7])
image(raw[,8])

  • Do you see any problems?

If we are happy with the quality of the raw data we can proceed to the next step. So far in the data, we have multiple probes within a probeset, and a perfect and mismatch measurement for each probe. These are not particular covenient values for statistical analysis as we would like a single measurement for each gene for each sample (/hybridisation).

Summarising and Normalising the Estrogen dataset

Many normalisation and summarisation methods have been developed for Affymetrix data. These include MAS5.0 and PLIER which were developed by Affymetrix, and RMA, GC-RMA, dChip and vsn (to name but a few) which have been developed by academic researchers. Many of these methods are available in the affy package. For a comparison of some of these methods and assessment of their performance on different control data sets, see Bolstad et al or Millenaar et al. In this practical we will use the RMA (Robust Multichip Averaging) method described in Irizarry et al.

Procedure;

  • model based background adjustment
  • followed by quantile normalisation and a
  • robust summary method (median polish) on the log\(_2\) PM intensities
  • to obtain probeset summary values.

Normalisation

  • We want to be observing biological and not technical variation
  • We wouldn’t expect such wholesale changes on a per-sample basis
  • Easy option would to scale values for each array to median level

  • What would happen if we had hybridised all the estrogen-treated samples on the first four arrays, and un-treated on the last four - Could you analyse the data?

  • Genes on array 2 are on average 2.6 lower than the global median, so add 2.6 to each gene
  • Genes on array 8 are on average 0.7 higher than the global median, so subtract 0.7 from each gene
  • etc

Non-linear effects

As we saw before The MA-plot is commonly-used in microarray analysis to commonly-used to visualise differences between pairs of arrays. These plots can reveal non-linear effects and motivate more-sophisticated methods. On older array technologies, it was typical to see a banana-shaped trend on these plots.

(not our dataset)

Quantile normalisation

This is arguably the most-popular normalisation method available. Consider the following matrix of values to be normalised

     Array1  Array2  Array3 
[1,] "Rank4" "Rank5" "Rank3"
[2,] "Rank2" "Rank4" "Rank2"
[3,] "Rank3" "Rank3" "Rank1"
[4,] "Rank5" "Rank1" "Rank4"
[5,] "Rank1" "Rank2" "Rank5"

Sort each column Smallest…Largest

Sorted data

apply(raw.values, 2,sort)
     Array1 Array2 Array3
[1,]   0.18   1.60   2.88
[2,]   0.72   1.65   3.12
[3,]   1.38   2.43   4.37
[4,]   2.19   2.92   4.93
[5,]   2.32   3.04   5.89

Then calculate target distribution by averaging the sorted rows

Rank1 Rank2 Rank3 Rank4 Rank5 
1.553 1.830 2.727 3.347 3.750 

Go back to the rank matrix

     Array1  Array2  Array3 
[1,] "Rank4" "Rank5" "Rank3"
[2,] "Rank2" "Rank4" "Rank2"
[3,] "Rank3" "Rank3" "Rank1"
[4,] "Rank5" "Rank1" "Rank4"
[5,] "Rank1" "Rank2" "Rank5"

Substitute with values from the target distribution

Rank1 Rank2 Rank3 Rank4 Rank5 
1.553 1.830 2.727 3.347 3.750 
ranked.values[,1] <- gsub("Rank1",target["Rank1"],ranked.values[,1])
ranked.values
     Array1  Array2  Array3 
[1,] "Rank4" "Rank5" "Rank3"
[2,] "Rank2" "Rank4" "Rank2"
[3,] "Rank3" "Rank3" "Rank1"
[4,] "Rank5" "Rank1" "Rank4"
[5,] "1.553" "Rank2" "Rank5"
ranked.values[,1] <- gsub("Rank2",target["Rank2"],ranked.values[,1])
ranked.values
     Array1  Array2  Array3 
[1,] "Rank4" "Rank5" "Rank3"
[2,] "1.83"  "Rank4" "Rank2"
[3,] "Rank3" "Rank3" "Rank1"
[4,] "Rank5" "Rank1" "Rank4"
[5,] "1.553" "Rank2" "Rank5"
ranked.values[,1] <- gsub("Rank3",target["Rank3"],ranked.values[,1])
ranked.values
     Array1  Array2  Array3 
[1,] "Rank4" "Rank5" "Rank3"
[2,] "1.83"  "Rank4" "Rank2"
[3,] "2.727" "Rank3" "Rank1"
[4,] "Rank5" "Rank1" "Rank4"
[5,] "1.553" "Rank2" "Rank5"
ranked.values[,1] <- gsub("Rank4",target["Rank4"],ranked.values[,1])
ranked.values
     Array1  Array2  Array3 
[1,] "3.347" "Rank5" "Rank3"
[2,] "1.83"  "Rank4" "Rank2"
[3,] "2.727" "Rank3" "Rank1"
[4,] "Rank5" "Rank1" "Rank4"
[5,] "1.553" "Rank2" "Rank5"
ranked.values[,1] <- gsub("Rank5",target["Rank5"],ranked.values[,1])
ranked.values
     Array1  Array2  Array3 
[1,] "3.347" "Rank5" "Rank3"
[2,] "1.83"  "Rank4" "Rank2"
[3,] "2.727" "Rank3" "Rank1"
[4,] "3.75"  "Rank1" "Rank4"
[5,] "1.553" "Rank2" "Rank5"
for(i in 1:3){
ranked.values[,i] <- gsub("Rank1",target["Rank1"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank2",target["Rank2"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank3",target["Rank3"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank4",target["Rank4"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank5",target["Rank5"],ranked.values[,i])
}
ranked.values <- as.data.frame(ranked.values)
rownames(ranked.values) <- rownames(raw.values)
raw.values
ranked.values

Final Code

The procedure can be encapsulated in relatively few lines of code

ranked.values <- apply(raw.values, 2, function(x) paste("Rank",
      rank(x,ties.method="min"),sep=""))
target <- round(rowMeans(apply(df, 2,sort)),3)
names(target) <- paste("Rank", 1:length(target),sep="")
for(i in 1:ncol(raw.values)){
  for(nm in names(target)){
    ranked.values[,i] <- gsub(nm,target[nm],ranked.values[,i])  
      }
}
norm <- as.data.frame(ranked.values)

We can use the normalizeQuantiles function within the limma package to do the normalisation.

library(limma)
?normalizeQuantiles

Caveats

  • Distributions of samples are expected to be the same
  • Majority of genes do not change between groups
    • might be violated if you have samples that are dramatically altered
    • e.g. low number of genes in a screening panel that are expected to change

Other normalisation procedures

  • loess or splines
    • need a reference or “average” array to compare
    • estimate the curve and use to correct each point
library(affy)
?normalize.loess

Running RMA on our dataset

The affy package provides lots of different ways to process raw Affymetricx data, one of which is an implementation of RMA. For other methods, see the expresso function or package Vignette

?expresso
vignette("affy")

The result is an ExpressionSet, which is a ubiquitous object in Biooconductor for storing high-throughput data. Later-on in the course will we import data from public repositories, and these data will often be in a normalised form.

eset <- rma(raw)
Background correcting
Normalizing
Calculating Expression
eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 12625 features, 8 samples 
  element names: exprs 
protocolData
  sampleNames: low10-1.cel low10-2.cel ... high48-2.cel (8 total)
  varLabels: ScanDate
  varMetadata: labelDescription
phenoData
  sampleNames: low10-1.cel low10-2.cel ... high48-2.cel (8 total)
  varLabels: estrogen time.h
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: hgu95av2 

The ExpressionSet is a convenient object-type for storing multiple high-dimensional data frames. The structure of the object is much more complicated than other types of data we have seen before. In fact we do not need to know need exactly how the object is constructed internally. The package authors have provided convenient functions that will allow us to access the data.

For example, if we want to extract the expression measurements themselves the function to use is called exprs. The result will be a matrix with one row for each gene, and one column for each sample.

head(exprs(eset))
          low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel low48-2.cel high48-1.cel
100_g_at     9.642896    9.741496     9.537036     9.353625    9.591697    9.570590     9.475796
1000_at     10.398169   10.254362    10.003971     9.903528   10.374866   10.033520    10.345066
1001_at      5.717613    5.881008     5.859563     5.954028    5.960540    6.020889     5.981080
1002_f_at    5.512596    5.801807     5.571065     5.608132    5.390064    5.494511     5.508104
1003_s_at    7.783927    8.007975     8.037999     7.835120    7.926487    8.138870     7.994937
1004_at      7.289162    7.603670     7.488539     7.771506    7.521789    7.599544     7.456149
          high48-2.cel
100_g_at      9.530655
1000_at       9.863321
1001_at       6.285192
1002_f_at     5.630107
1003_s_at     8.233338
1004_at       7.675171
summary(exprs(eset))
  low10-1.cel      low10-2.cel      high10-1.cel     high10-2.cel     low48-1.cel    
 Min.   : 4.269   Min.   : 4.280   Min.   : 4.294   Min.   : 4.135   Min.   : 4.238  
 1st Qu.: 5.601   1st Qu.: 5.628   1st Qu.: 5.592   1st Qu.: 5.621   1st Qu.: 5.630  
 Median : 7.108   Median : 7.123   Median : 7.108   Median : 7.123   Median : 7.102  
 Mean   : 7.295   Mean   : 7.295   Mean   : 7.293   Mean   : 7.293   Mean   : 7.291  
 3rd Qu.: 8.600   3rd Qu.: 8.592   3rd Qu.: 8.616   3rd Qu.: 8.607   3rd Qu.: 8.588  
 Max.   :14.736   Max.   :14.763   Max.   :14.765   Max.   :14.733   Max.   :14.750  
  low48-2.cel      high48-1.cel     high48-2.cel   
 Min.   : 4.306   Min.   : 4.273   Min.   : 4.327  
 1st Qu.: 5.636   1st Qu.: 5.591   1st Qu.: 5.626  
 Median : 7.094   Median : 7.063   Median : 7.078  
 Mean   : 7.267   Mean   : 7.287   Mean   : 7.257  
 3rd Qu.: 8.570   3rd Qu.: 8.598   3rd Qu.: 8.571  
 Max.   :14.761   Max.   :14.706   Max.   :14.716  

You will notice the rma also incorporates a log\(_2\) transformation. The values present in the cel files are derived from processing the TIFF images of the chip surface, which are on the scale 0 - 2^16. However, this is not a very convenient scale for visualisation and analysis as most of the observations are found at the lower end of the scale and there is a much greater variance at the higer values. We say that the data exhibit ‘heteroscedasticity’. To alleviate this we can use a data transformation such as log\(_2\) and afterwards our data will be in the range 0 to 16. Another method you may come across is vsn.

boxplot(exprs(eset),las=2)


How can you tell from the boxplot that these data have been normalised?


An MA-plot can be repeated on the normalised data to verify that the normalisation procedure has been sucessful.

mva.pairs(exprs(eset)[,1:4],log.it = FALSE,plot.method="smoothScatter")

If we have used a targets file to read the raw data, this information will stored in the summarised object. Commonly-referred to as “pheno data”, this can retrieved using the pData function. A nice property of this data frame is that the rows are arranged in the same order as the columns of the expression matrix. i.e. The first column of the expression matrix is described in the first row of the pheno data matrix, and so on.

head(pData(eset))
colnames(exprs(eset))
[1] "low10-1.cel"  "low10-2.cel"  "high10-1.cel" "high10-2.cel" "low48-1.cel"  "low48-2.cel" 
[7] "high48-1.cel" "high48-2.cel"
rownames(pData(eset))
[1] "low10-1.cel"  "low10-2.cel"  "high10-1.cel" "high10-2.cel" "low48-1.cel"  "low48-2.cel" 
[7] "high48-1.cel" "high48-2.cel"

Clustering techniques and PCA that we will meet later in the course can also inform decisions about experiment quality by providing a way of visualing relationships between sample groups.

Automated production of QC plots

The arrayQualityMetrics package is able to produce QA plots from set of cel files, or a normalised dataset.

library(arrayQualityMetrics)
arrayQualityMetrics(eset)

Thoughts on Quality Assessment

  • It is difficult to come up with definitive rules about when to reject a sample from the analysis
  • Try and keep as much information from the lab
    • RIN (RNA Integrity Numbers), tumour composition
  • If we try and reject samples for looking different, we might bias our results and miss interesting findings
  • If the majority of our samples are poor quality, the better quality ones might stand-out as outliers!
  • Try and collect as many samples / replicates as possible to be tolerant to variation

Summary

  • Affy data come in .cel files that can be imported with the affy package
  • QC checks on the raw data include;
    • check the chip image
    • probe-level models
    • boxplots
  • Affy data need to be summarised before further analysis
    • rma (and variants thereof) is most-popular
  • Affy data can be summarised into a common Bioconductor object-type
    • The “ExpressionSet
  • The ExpressionSet, or derivaties thereof, is used throughout Bioconductor to represent all types of high-throughput data
    • methylation, ChIP, RNA-seq
    • basically anything where you can have some kind of genomic measure as rows, and samples as columns
---
title: "The Affymetrix Analysis Workflow"
author: Mark Dunning, based on materials by Benilton Carvalho
output: 
  html_notebook: 
    toc: yes
    toc_float: yes
---


![](https://upload.wikimedia.org/wikipedia/commons/2/22/Affymetrix-microarray.jpg)

# Technology overview

Affymetrix microarrays are a popular commercial platform available for a wide range of
genomics applications (gene expression profiling, SNP genotyping, ChIP-chip analysis etc.)
in different species.
The main distinction between Affymetrix and other array technologies is the use of many
short (25mer) probes to measure hybridisation.
In this practical, we explore the basic analysis issues for Affymetrix GeneChips which measure gene expression using multiple (11-20) perfect match (PM) and mismatch (MM) probes
concentrated in the 3’ region of each transcript. 

![](images/pm.png)

Despite our focus on expression arrays, many
of the processing steps, such as quality assessment and normalisation are equally important
for other applications of Affymetrix technology to ensure that the signal is comparable between arrays.
In this practical we use several Bioconductor packages ([affy](http://bioconductor.org/packages/release/bioc/html/affy.html), [affyPLM](http://bioconductor.org/packages/release/bioc/html/affyPLM.html), [limma](http://bioconductor.org/packages/release/bioc/html/limma.html), etc.) to read in
the raw data, assess data quality and normalise an Affymetrix experiment. Raw data for Affymetrix arrays are usually stored in [`.cel`](http://dept.stat.lsa.umich.edu/~kshedden/Courses/Stat545/Notes/AffxFileFormats/cel.html) format. We wouldn't normally wish to view these files directly. These files contain the intensities of each "cell" on chip surface after image analysis. 

  - Each hybridisation has a unique *cel* file
  - It is know in advance which probe / probe set is located within each cell
  - Each chip type (e.g.hgu95av2) has a *cdf* (chip description) file which is required to decode which probe and probe set is within each cell.

# Example analysis in R

## The estrogen dataset

The experiment we will analyse is made-up of eight Affymetrix *HGU95Av2*
GeneChips. The aim of the experiment is briefly described below (excerpt taken from the
factDesign package vignette).

> “The investigators in this experiment were interested in the effect of estrogen on the genes in ER+ breast cancer cells over time. After serum starvation of all eight samples, they exposed
four samples to estrogen, and then measured mRNA transcript abundance after 10 hours for
two samples and 48 hours for the other two. They left the remaining four samples untreated,
and measured mRNA transcript abundance at 10 hours for two samples, and 48 hours for
the other two. Since there are two factors in this experiment (estrogen and time), each at
two levels (present or absent, 10 hours or 48 hours), this experiment is said to have a 2 × 2
factorial design.”

The data for this section are described in the [estrogen data package](http://www.bioconductor.org/packages/release/data/experiment/html/estrogen.html) in Bioconductor

The cel files for the example experiement are stored in the `estrogen` directory

The first stage is to read the *targets* file. In Bioconductor we often use a targets file to define the files corresponding to the raw data for the experiment and which sample groups they belong to. Such a file can be created in a spreadsheet, or text editor, and usually saved as a tab-delimited file. We can have as many columns in the file as we see fit, and one row for each hybridisation. The sample group information is propogated through the analysis and incorporated in the quality assessment and eventually differential expression. We refer to this as the "phenotype data" for the experiment, or sometimes the *"metadata"*.

### Packages and data 

If you did not install the `affy` Bioconductor package, you will need to do so now:-

```{r eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite("affy")
```

The data for the practical can be found in the [course zip file](https://rawgit.com/bioinformatics-core-shared-training/microarray-analysis/master/Course_Data.zip)

```{r}
library(affy)
targetsFile <- "estrogen/estrogen.txt"
targetsFile
pd <- read.AnnotatedDataFrame(targetsFile,header=TRUE,sep="",row.names=1)
pData(pd)
```



As `.cel` files are not a typical file format, we cannot use the standard R functions such as `read.delim`, `read.csv` and `read.table`. The function to import the data is `ReadAffy` from the [affy](http://bioconductor.org/packages/release/bioc/html/affy.html) package. 

```{r}

raw <-ReadAffy(celfile.path = "estrogen", filenames=rownames(pData(pd)),phenoData = pd)
raw
```

******
What type of Affy array is this
How many features?
How many samples?

******




## Diagnostic plots 

As with other high-throughput technologies, quality assessment is an important part of the
analysis process. Data quality can be checked using various diagnostic plots.

The first diagnostic plot we will meet is the "[boxplot](https://en.wikipedia.org/wiki/Box_plot)", which is a commonly-used plot in data analysis for *comparing data distributions*. The exact definition of how is boxplot is drawn can vary, the definitions always hold;

- The bottom of the box is the *first quartile*; the 25th percentile
- The top of the box is the *third quartile*; the 75th percentile
- The median is represented by a horizontal line
- The *inter-quartile range* (IQR) is the difference between the 75th and 25th percentiles

Here, we draw a boxplot to compare two distributions; `x` with a mean of 0 and standard deviation of 1, and `y` with a mean of 2 and standard deviation of 4.

```{r}
x <- rnorm(100)
y <- rnorm(100, mean = 2,sd=4)
boxplot(x,y)
```

It is also common to draw whiskers that extend to 1.5 X the IQR from the lower and upper quartile. Any points outside this range can be represented by a dot.


In Bioconductor, you will usually find that package authors have created shortcuts to allow complicated data types to be visualised with common functions. For instance, if we want to construct a boxplot from our raw Affymetrix data it would be quite a daunting task for the novice to extract the revelant information from the object. Instead, we can use functions that we are familiar with such as `boxplot`. The plot can also be customised in ways that we are familiar with such as changing the color (the `col` argument) and label orientation (`las=2` to make labels perpendicular to the axis)

```{r}
boxplot(raw,col="red",las=2)
```

******

### What do you notice from this plot?

******


## Perfect-Match and Mismatch probes

With the short DNA sequences used for microarray hybridisation, there is the possibility for the sequence designed to not be specific-enough and hybridise to many regions of the genome. Affymetrix attempt to resolve this issue by having a mismatch sequence corresponding to each probe. Ideally, the subtracting the mismatch from the corresponding perfect match( $PM - MM$) should yield a background-corrected, and more-reliable, signal.


******

Generate histograms of the PM and MM intensities from the first array. Do you
notice any difference in the signal distribution of the PMs and MMs?

******

```{r}
par(mfrow=c(2,1))
hist(log2(pm(raw[,1])),breaks=100,col="steelblue",main="PM",xlim=c(4,14))
hist(log2(mm(raw[,1])),breaks=100,col="steelblue",main="MM",xlim=c(4,14))

```



***M-A*** plots are a useful way of comparing the red and green channels from a two-colour
microarray. For Affymetrix data, which is a single-channel technology, M and A values can
be calculated using the intensities from a pair of chips. We would typically produce a plot using samples from the same biological group, where we would not expect to observe much difference in intensity for a given gene.

i.e. if X$_i$ is the intensity of a given
probe from chip $i$ and X$_j$ is the intensity for the same probe on chip $j$, then 
$A = 1/2 (log_2(X_i) + log_2(X_j))$ and $M = log_2(X_i) − log_2(X_j)$ . In this experiment, there are 8 GeneChips, which gives
28 distinct pair-wise comparisons between arrays. We will focus on a subset of these below.

```{r}
mva.pairs(pm(raw)[,1:4],plot.method="smoothScatter")
mva.pairs(pm(raw)[,5:8],plot.method="smoothScatter")
```


******
Based on all the plots you have generated, what would you conclude about the overall quality of this experiment? 
******



## Probe-level Linear Models

Probe-level Linear Models (PLMs) can be used as an additional tool to assess relative
data quality within an experiment. Many different model specifications are possible, with
the simplest fitting chip, and probe effects to the log$_2$ intensities within each probeset across
an experiment in a robust way.
The output is a matrix of residuals, or weights for each chip which can be used as an additional diagnostic; systematically high residuals across an entire array, or a large fraction of an array is indicative of an outlier array. The main use of this tool is in deciding whether or
not to keep an array in the down-stream data analysis.

### Relative Log Expression (RLE)

The Relative Log Expression (RLE) values are computed by calculating for each probe-set the ratio between the expression of a probe-set and the median expression of this probe-set across all arrays of the experiment. It is assumed that most probe-sets are not changed across the arrays, so it is expected that these ratios are around 0 on a log scale. The boxplots presenting the distribution of these log-ratios should then be centered near 0 and have similar spread. Other behavior would be a sign of low quality.


### Normalized Unscaled Standard Error (NUSE) 

The Normalized Unscaled Standard Error (NUSE) is the individual probe error fitting the Probe-Level Model (the PLM models expression measures using a M-estimator robust regression). The NUSE values are standardized at the probe-set level across the arrays: median values for each probe-set are set to 1. The boxplots allow checking (1) if all distributions are centered near 1 (typically an array with a boxplot centered around 1.1 shows bad quality) and (2) if one array has globally higher spread of NUSE distribution than others, which may also be a sign of low quality.

```{r}
library(affyPLM)
plmset <- fitPLM(raw)
NUSE(plmset,las=2)
RLE(plmset,las=2)
```


We can look at the images of the array surface. Ocassionally this can reveal problems on the array. See this [gallery](http://plmimagegallery.bmbolstad.com/) of interesting examples. Affymetrix, and other older arrays are vunerable to spatial artefacts as the same probe set is found in the same location on each chip. 

If you want to look at an example of a bad array image, we can use the following code;

```{r}
bad <- ReadAffy(celfile.path = "estrogen/",filenames="bad.cel")
image(bad)
```

Try out some images for this dataset

```{r}
par(mfrow=c(2,4))
image(raw[,1])
image(raw[,2])
image(raw[,3])
image(raw[,4])
image(raw[,5])
image(raw[,6])
image(raw[,7])
image(raw[,8])
```


- Do you see any problems?

If we are happy with the quality of the raw data we can proceed to the next step. So far in the data, we have multiple probes within a probeset, and a perfect and mismatch measurement for each probe. These are not particular covenient values for statistical analysis as we would like a *single measurement* for each gene for each sample (/hybridisation). 

## Summarising and Normalising the Estrogen dataset

Many normalisation and summarisation methods have been developed for Affymetrix
data. These include MAS5.0 and PLIER which were developed by Affymetrix, and RMA,
GC-RMA, dChip and vsn (to name but a few) which have been developed by academic researchers.
Many of these methods are available in the [affy](http://bioconductor.org/packages/release/bioc/html/affy.html) package. For a comparison of some of these
methods and assessment of their performance on different control data sets, see [Bolstad et al](http://bioinformatics.oxfordjournals.org/content/19/2/185.long) or [Millenaar et al](http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-7-137). In
this practical we will use the ***RMA*** (Robust Multichip Averaging) method described in [Irizarry et al](http://www.ncbi.nlm.nih.gov/pubmed/12925520).

Procedure;

- model based background adjustment 
- followed by quantile normalisation and a 
- robust summary method (median polish) on the log$_2$ PM intensities 
- to obtain *probeset summary values*.

## Normalisation


- We want to be observing *biological* and not *technical* variation
- We wouldn't expect such wholesale changes on a per-sample basis
- Easy option would to scale values for each array to median level

- **What would happen if we had hybridised all the *estrogen-treated* samples on the first four arrays, and *un-treated* on the last four** - Could you analyse the data?
![](images/linear-effects.png)


- Genes on array 2 are on average 2.6 lower than the global median, so add 2.6 to each gene
- Genes on array 8 are on average 0.7 higher than the global median, so subtract 0.7 from each gene
- etc



## Non-linear effects

As we saw before The *MA-plot* is commonly-used in microarray analysis to commonly-used to visualise differences between pairs of arrays. These plots can reveal non-linear effects and motivate more-sophisticated methods. On older array technologies, it was typical to see a banana-shaped trend on these plots.

(***not our dataset***)

![](images/non-linear.png)


## Quantile normalisation

This is arguably the most-popular normalisation method available. Consider the following matrix of values to be normalised

```{r echo=FALSE}
# Set a seed to make sure we get the same matrix
set.seed("15022016")
raw.values <- round(data.frame(Array1 = rnorm(5,mean = 2), Array2 = rnorm(5,mean=3), Array3 = rnorm(5,mean=4)),2)
rownames(raw.values) <- LETTERS[1:nrow(raw.values)]
raw.values
```


```{r echo=FALSE}
ranked.values <- apply(raw.values, 2, function(x) paste("Rank",rank(x,ties.method="min"),sep=""))
ranked.values
```

Sort each column Smallest...Largest

Sorted data
```{r}
apply(raw.values, 2,sort)
```

Then calculate target distribution by averaging the sorted rows
```{r echo=FALSE} 
target <- round(rowMeans(apply(raw.values, 2,sort)),3)
names(target) <- paste("Rank", 1:length(target),sep="")
target
```

Go back to the rank matrix

```{r echo=FALSE}
ranked.values

```
Substitute with values from the target distribution
```{r echo=FALSE}
target
```

```{r}
ranked.values[,1] <- gsub("Rank1",target["Rank1"],ranked.values[,1])
ranked.values
ranked.values[,1] <- gsub("Rank2",target["Rank2"],ranked.values[,1])
ranked.values
ranked.values[,1] <- gsub("Rank3",target["Rank3"],ranked.values[,1])
ranked.values
ranked.values[,1] <- gsub("Rank4",target["Rank4"],ranked.values[,1])
ranked.values
ranked.values[,1] <- gsub("Rank5",target["Rank5"],ranked.values[,1])
ranked.values
```


```{r}
for(i in 1:3){
ranked.values[,i] <- gsub("Rank1",target["Rank1"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank2",target["Rank2"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank3",target["Rank3"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank4",target["Rank4"],ranked.values[,i])
ranked.values[,i] <- gsub("Rank5",target["Rank5"],ranked.values[,i])
}
ranked.values <- as.data.frame(ranked.values)
rownames(ranked.values) <- rownames(raw.values)

```

```{r}
raw.values
ranked.values
```


##Final Code

The procedure can be encapsulated in relatively few lines of code

```{r eval=FALSE}

ranked.values <- apply(raw.values, 2, function(x) paste("Rank",
      rank(x,ties.method="min"),sep=""))
target <- round(rowMeans(apply(df, 2,sort)),3)
names(target) <- paste("Rank", 1:length(target),sep="")
for(i in 1:ncol(raw.values)){
  for(nm in names(target)){
    ranked.values[,i] <- gsub(nm,target[nm],ranked.values[,i])  
      }
}
norm <- as.data.frame(ranked.values)
```



We can use the `normalizeQuantiles` function within the `limma` package to do the normalisation. 

```{r message=FALSE}
library(limma)
?normalizeQuantiles
```

### Caveats

- Distributions of samples are expected to be the same
- Majority of genes do not change between groups
    + might be violated if you have samples that are dramatically altered
    + e.g. low number of genes in a screening panel that are expected to change
    
## Other normalisation procedures

- loess or splines
    + need a reference or "average" array to compare 
    + estimate the curve and use to correct each point
    

```{r}
library(affy)
?normalize.loess
```


## Running RMA on our dataset

The affy package provides lots of different ways to process raw Affymetricx data, one of which is an implementation of RMA. For other methods, see the `expresso` function or package *Vignette*


```{r eval=FALSE}
?expresso
vignette("affy")
```

The result is an `ExpressionSet`, which is a ubiquitous object in Biooconductor for storing high-throughput data. Later-on in the course will we import data from public repositories, and these data will often be in a normalised form.

```{r}
eset <- rma(raw)
eset
```

The `ExpressionSet` is a convenient object-type for storing multiple high-dimensional data frames. The structure of the object is much more complicated than other types of data we have seen before. In fact we do not need to know need exactly *how* the object is constructed internally. The package authors have provided convenient functions that will allow us to access the data. 

For example, if we want to extract the expression measurements themselves the function to use is called `exprs`. The result will be a matrix with one row for each gene, and one column for each sample.

```{r}
head(exprs(eset))
summary(exprs(eset))
```

You will notice the *rma* also incorporates a log$_2$ transformation. The values present in the *cel* files are derived from processing the *TIFF* images of the chip surface, which are on the scale 0 - 2^16. However, this is not a very convenient scale for visualisation and analysis as most of the observations are found at the lower end of the scale and there is a much greater variance at the higer values. We say that the data exhibit ['heteroscedasticity'](https://en.wikipedia.org/wiki/Heteroscedasticity). To alleviate this we can use a data transformation such as log$_2$ and afterwards our data will be in the range 0 to 16. Another method you may come across is [vsn](http://bioinformatics.oxfordjournals.org/content/18/suppl_1/S96.long). 


```{r}
boxplot(exprs(eset),las=2)
```


******
### How can you tell from the boxplot that these data have been normalised?
******

An MA-plot can be repeated on the normalised data to verify that the normalisation procedure has been sucessful.

```{r}
mva.pairs(exprs(eset)[,1:4],log.it = FALSE,plot.method="smoothScatter")
```



If we have used a *targets* file to read the raw data, this information will stored in the summarised object. Commonly-referred to as "pheno data", this can retrieved using the `pData` function. A nice property of this data frame is that the *rows* are arranged in the same order as the *columns* of the expression matrix. i.e. The first column of the expression matrix is described in the first row of the pheno data matrix, and so on. 

```{r}
head(pData(eset))
colnames(exprs(eset))
rownames(pData(eset))
```


Clustering techniques and PCA that we will meet later in the course can also inform decisions about experiment quality by providing a way of visualing relationships between sample groups.


![](images/affy-clustering.png)

![](images/affy-pca.png)


## Automated production of QC plots

The `arrayQualityMetrics` package is able to produce QA plots from set of cel files, or a normalised dataset.

```{r eval=FALSE}
library(arrayQualityMetrics)
arrayQualityMetrics(eset)
```

### Thoughts on Quality Assessment 

- It is difficult to come up with definitive rules about when to reject a sample from the analysis
- Try and keep as much information from the lab
    + RIN (RNA Integrity Numbers), tumour composition
- If we try and reject samples for looking different, we might bias our results and miss interesting findings
- If the majority of our samples are poor quality, the better quality ones might stand-out as outliers!
- Try and collect as many samples / replicates as possible to be tolerant to variation

## Summary

- Affy data come in `.cel` files that can be imported with the `affy` package
- QC checks on the raw data include;
    + check the chip image
    + probe-level models
    + boxplots
- Affy data need to be summarised before further analysis
    + `rma` (and variants thereof) is most-popular
- Affy data can be summarised into a common Bioconductor object-type
    + The "`ExpressionSet`"
- The `ExpressionSet`, or derivaties thereof, is used throughout Bioconductor to represent all types of high-throughput data
    + methylation, ChIP, RNA-seq
    + basically anything where you can have some kind of genomic measure as rows, and samples as columns
